Compression of Unicode Files

نویسندگان

  • Peter M. Fenwick
  • Simon Brierley
چکیده

The increasing importance of Unicode for text files, for example with Java and in some modern operating systems, implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. However it is not clear that data compressors designed for 8-bit byte data are well matched to 16-bit Unicode data. This paper investigates the compression of Unicode files, using a variety of established data compressors on a mix of genuine and artificial Unicode files. It is found that while Ziv-Lempel and unbounded context compressors work well, finite-context compressors are less satisfactory on Unicode. Tests with a simple special compressor intended for 16-bit data show that it may be useful to design compressors specifically for Unicode files.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey of Unicode compression

The Unicode (ISO/IEC 10646) coded character set is the largest of its kind.1 Almost a million code positions are available in Unicode for formal character encoding, with more than 137,000 additional code positions reserved for private-use characters. This is quite a change from the 128 or 256 characters available in 8-bit “legacy” code pages, or even the thousands available in East Asian double...

متن کامل

Arabic Text Steganography Using Unicode of Non-Joined to Right Side Letters

Email: [email protected], [email protected] Abstract: Steganography is a technique for hiding data in media in a way that makes its existence hard to detect. Text files are a preferable format for use in steganography due to the small storage size of such files. This paper presents an Arabic text steganographic algorithm based on Unicode. The algorithm imposes a minimal change on connected le...

متن کامل

Performance Improvement Of Bengali Text Compression Using Transliteration And Huffman Principle

In this paper, we propose a new compression technique based on transliteration of Bengali text to English. Compared to Bengali, English is a less symbolic language. Thus transliteration of Bengali text to English reduces the number of characters to be coded. Huffman coding is well known for producing optimal compression. When Huffman principal is applied on transliterated text significant perfo...

متن کامل

An Image Lossless Compression Patent

The present general lossless compression algorithm is not effective for the compression effect of JPEG files. In this article, the lossless compression method combining the shuffling algorithm with the lossless compression algorithm and a new shuffling algorithm are proposed, and this new algorithm could compresses the JPEG files without losses, and the result indicates that this algorithm can ...

متن کامل

Accordion Arrays: Selective Compression of Unicode Arrays in Java

In this work, we present accordion arrays, a straightforward and effective memory compression technique targeting Unicode-based character arrays. In many non-numeric Java programs, character arrays represent a significant fraction (30-40% on average) of the heap memory allocated. In many locales, most, but not all, of those arrays consist entirely of characters whose top bytes are zeros, and, h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998